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Abstract 

While sparse coding-based clustering methods have shown to be 
successful, their bottlenecks in both efficiency and scalability limit 
the practical usage. In recent years, deep learning has been proved 
to be a highly effective, efficient and scalable feature learning tool. 
In this paper, we propose to emulate the sparse coding-based clus¬ 
tering pipeline in the context of deep learning, leading to a care¬ 
fully crafted deep model benefiting from both. A feed-forward net¬ 
work structure, named TAGnet, is constructed based on a graph- 
regularized sparse coding algorithm. It is then trained with task- 
specific loss functions from end to end. We discover that connecting 
deep learning to sparse coding benefits not only the model perfor¬ 
mance, but also its initialization and interpretation. Moreover, by 
introducing auxiliary clustering tasks to the intermediate feature hi¬ 
erarchy, we formulate DTAGnet and obtain a further performance 
boost. Extensive experiments demonstrate that the proposed model 
gains remarkable margins over several state-of-the-art methods. 

1 Introduction 

Clustering aims to learn the hidden data patterns and group 
similar structures in a unsupervised way. While many clas¬ 
sical clustering algorithms have been proposed, such as 
K-means, Gaussian mixture model (GMM) clustering a, 
maximum-margin clustering ED and information theoretic 
clustering im, most only work well when the data dimen¬ 
sionality is low. Since high-dimensional data exhibits dense 
grouping in low-dimensional embeddings ll23l . researchers 
have been motivated to first project the original data into a 
low-dimensional subspace m and then clustering on the 
feature embeddings. Among many feature embedding learn¬ 
ing methods, sparse codes ESI are proven to be robust and 
efficient features for clustering, as verified by many lISl lMll . 

Effectiveness and scalability are two major concerns in 
designing a clustering algorithm under Big Data scenarios 
©. Conventional sparse coding models rely on iterative ap¬ 
proximation algorithms, whose inherently sequential struc¬ 
ture as well as the data-dependent complexity and latency 
often constitute a major bottleneck in the computational effi¬ 
ciency M- That also results in the difficulty when one tries 
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to jointly optimize the unsupervised feature learning and the 
supervised task-driven steps ll2Ql . Such a joint optimization 
usually has to rely on solving complex bi-level optimization 
ii, such as ll2^ . which constitutes another efficiency bot¬ 
tleneck. What is more, to effectively model and represent 
datasets of growing sizes, sparse coding needs to refer to 
larger dictionaries 03 . Since the inference complexity of 
sparse coding increases more than linearly with respect to the 
dictionary size ll29l . the scalability of sparse coding-based 
clustering work turns out to be quite limited. 

To conquer those limitations, we are motivated to intro¬ 
duce the tool of deep learning in clustering, to which there 
has been a lack of attention paid. The advantages of deep 
learning are achieved by its large learning capacity, the linear 
scalability with the aid of stochastic gradient descent (SGD), 
and the low inference complexity El- The feed-forward net¬ 
works could be naturally tuned jointly with task-driven loss 
functions. On the other hand, generic deep architectures E3 
largely ignore the problem-specific formulations and prior 
knowledge. As a result, one may encounter difficulties in 
choosing optimal architectures, interpreting their working 
mechanisms, and initializing the parameters. 

In this paper, we demonstrate how to combine the 
sparse coding-based pipeline into deep learning models 
for clustering. The proposed framework takes advantage 
of both sparse coding and deep learning. Specifically, the 
feature learning layers are inspired by the graph-regularized 
sparse coding inference process, via reformulating iterative 
algorithms lfT2l into a feed-forward network, named TAG¬ 
net. Those layers are then jointly optimized with the task- 
specific loss functions from end to end. Our technical nov¬ 
elty and merits are summarized in three-folds: 

• As a deep feed-forward model, the proposed framework 
provides extremely efficient inference process and high 
scalability to large scale data. It allows to learn more 
descriptive features than conventional sparse codes. 

• We discover that incorporating the expertise of sparse 
code-based clustering pipelines El 1?^ improves our 
performances significantly. Moreover, it greatly facil¬ 
itates the model initialization and interpretation. 

• We further enforce auxiliary clustering tasks on the 
hierarchy of features, we develop DTAGnet and ob¬ 
serve further performance boosts on the CMU MultiPIE 
dataset lfT3l . 
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Figure 1: (a) The proposed pipeline, consisting of the TAGnet network for feature learning, followed by the clustering- 
oriented loss functions. The parameters W, S, 6 and a; are all learnt end-to-end from training data, (b) The block diagram 
of solving (3.3 I. 


2 Related Work 

2.1 Sparse coding for clustering Assuming data samples 

X = [xi,X 2 ,'- - ,x„], where x^ € and i = 

1,2,-•• ,n. They are encoded into sparse codes A = 
[ai, a 2 , • • • , a„], where a^ G and i = 1,2, • • • , n, 

using a learned dictionary D = [di,d 2 ,--- ,dp], where 
di G = 1,2, ••• ,p are the learned atoms. The 

sparse codes are obtained by solving the following convex 
optimization (A is a constant); 

(2.1) A = argminA jHX - DA|||, -f AI|a*||i, 

In is], the authors suggested that the sparse codes can be 
used to construct the similarity graph for spectral clustering 
II 22 I . Furthermore, to capture the geometric structure of 
local data manifolds, the graph regularized sparse codes are 
further suggested in ll34ll^ by solving: 

^ = argminA i||X- DA||| + A^^i l|ai||i 
^ ’ +fTr(ALAT), 

where L is the graph Laplacian matrix and can be con¬ 
structed from a pre-chosen pairwise similarity (affinity) ma¬ 
trix P. More recently in ll29l . the authors suggested to simul¬ 
taneously learn feature extraction and discriminative cluster¬ 
ing, by formulating a task-driven sparse coding model 1201 . 
They proved that such joint methods consistently outper¬ 
formed non-joint counterparts. 

2.2 Deep learning for clustering In ll26l . the authors ex¬ 
plored the possibility of employing deep learning in graph 
clustering. They first learned a nonlinear embedding of the 
original graph by an auto encoder (AE), followed by a K- 
means algorithm on the embedding to obtain the final clus¬ 
tering result. However, it neither exploits more adapted deep 
architectures nor performs any task-specific joint optimiza¬ 
tion. In Q, a deep belief network (DBN) IITtI with non- 
parametric clustering was presented. As a generative graph¬ 
ical model, DBN provides a faster feature learning, but is 


less effective than AEs in terms of learning discriminative 
features for clustering. In G 2 I, the authors extended the 
semi non-negative matrix factorization (Semi-NMF) model 
m to a Deep Semi-NME model, whose architecture resem¬ 
bles stacked AEs. Our proposed model is substantially dif¬ 
ferent from all these previous approaches, due to its unique 
task-specific architecture derived from sparse coding domain 
expertise, as well as the joint optimization with clustering- 
oriented loss functions. 

3 Model Formulation 

The proposed pipeline consists of two blocks. As depicted 
in Eig. [T](a), it is trained end-to-end in an unsupervised way. 
It includes a feed-forward architecture, termed Task-specific 
And Graph-regularized Network (TAGnet), to learn discrim¬ 
inative features, and the clustering-oriented loss function. 

3.1 TAGnet; Task-specific And Graph-regularized Net¬ 
work Different from generic deep architectures, TAGnet is 
designed in a way to take advantage of the successful sparse 
code-based clustering pipelines ll34l l29ll . It aims to learn 
features that are optimized under clustering criteria, while 
encoding graph constraints ( |2.2| l to regularize the target so¬ 
lution. TAGnet is derived from the following theorem: 

Theorem 3.1. The optimal sparse code A from ( [2.2| ) is the 
fixed point of 

(3.3) 

A = h. [(I - ^D^D)A - A(^L) + iD^X], 

where he is an element-wise shrinkage function parameter¬ 
ized by 0: 

(3.4) [ft.e(u)]i = i/gn(ui)(|ui| - 0^)+. 

N is an upper bound on the largest eigenvalue o/D^D. 

The complete proof of Theorem |3.1| can be found in the sup¬ 
plementary. Theorem m] outlines an iterative algorithm to 







































solve Under quite mild conditions 12, after A is ini¬ 

tialized, one may repeat the shrinkage and thresholding pro¬ 
cess in (3.31 until convergence. Moreover, the iterative algo¬ 
rithm could be alternatively expressed as the block diagram 
in Fig. [^(b), where 


(3.5) 


W = — 


S = I-iD^D,0= A. 


3.2 Clustering-oriented loss functions Assuming K 
clusters, and u) = [wi, as the set of parameters of 

the loss function, where corresponds to the i-th cluster, i 
= 1,2,..., A. In this paper, we adopt the following two forms 
of clustering-oriented loss functions. 

One natural choice of the loss function is extended from 
the popular softmax loss, and take the entropy-like form as: 


In particular, we define the new operator “xL”: A — 
— ^AL, where the input A is multiplied by the pre-fixed 
L from the right side and scaled by the constant — ^. 

By time-unfolding and truncating Fig.J2(b) to a fixed 
number of K iterations (K = 2 by default|J we obtain the 
TAGnet form in Fig. [^(a). W, S and 0 are all to be learnt 
jointly from data. S and 0 are tied weights for both stage^ 
It is important to note that the output A of TAGnet is not 
necessarily identical to the predicted sparse codes by solving 
(2.2 1 . Instead, the goal of TAGnet is to learn discriminative 
embedding that is optimal for clustering. 

To facilitate training, we further rewrite (|3.4|| as: 

(3.6) 

[/ie(u)]j = Gi ■ sign(uj)(|ui|/0i - 1)+ = BihiivLi/Gi) 


Eqn. ( |3.6| l indicates that the original neuron with trainable 
thresholds can be decomposed into two linear scaling layers 
plus a unit-threshold neuron. The weights of the two scaling 
layers are diagonal matrices defined by 9 and its element¬ 
wise reciprocal, respectively. 

A notable component in TAGnet is the xL branch of 
each stage. The graph laplacian L could be computed in 
advance. In the feed-forward process, a xL branch takes the 
intermediate (fc = 1,2) as the input, and applies the “xL” 
operator defined above. The output is aggregated with the 
output from the learnable S layer. In the back propagation, 
L will not be altered. In such a way, the graph regularization 
is effectively encoded in the TAGnet structure as a prior. 

An appealing highlight of (D)TAGnet lies in its very ef¬ 
fective and straightforward initialization strategy. With suffi¬ 
cient data, many latest deep networks train well with random 
initializations without pre-training. However, it has been dis¬ 
covered that poor initializations hamper the effectiveness of 
first-order methods (e.g., SGD) in certain cases ll25l . For 
(D)TAGnet, it is however much easier to initialize the model 
in the right regime. That benefits from the analytical rela¬ 
tionships between sparse coding and network hyperparam¬ 
eters defined in ( |3.5| i: we could initialize deep models from 
corresponding sparse coding components, the latter of which 
is easier to obtain. Such an advantage becomes much more 
important when the training data is limited 


’We test larger K values (3 or 4), but they do not bring noticeable 
performance improvements in our clustering cases. 

^Out of curiosity, we have also tried the architecture that treat W, S and 
0 in both stages as independent variables. We find that sharing parameters 
improves the performance. 


(3.7) C'(A,u)) = 

where pij denotes the the probability that sample Xi belongs 
to cluster j, i = 1, 2, • • • , A and j = 1,2, • • • ,K: 

_ T 

(3.8) p,j=p{j\iv,ai)= ^ 

In testing, the predicted cluster label of input is deter¬ 
mined using the maximum likelihood criteria based on the 
predicted pij . 

The maximum margin clustering (MMC) approach was 
proposed in 11311 . MMC finds a way to label the samples by 
running an SVM implicitly, and the SVM margin obtained 
would be maximized over all possible labels 122. By 
referring to the MMC definition, the authors of 1291 designed 
the max-margin loss: 

(3.9) C(A,a;) = f||u;||2 + ^-^^C(a„a;). 

In the above equation, the loss for an individual sample a^ is 
defined as: 

C(a,, w) = max(0,1 -f /''• (a,) - fy' (a,)) 

/o yi= argmax 

(3.10) 

ri= argmax P(a^). 

where is the prototype for the j-th cluster. In testing, the 
predicted cluster label of input a^ is determined by weight 
vector that achieves the maximum ljJ a^. 

Model Complexity The proposed framework can handle 
large-scale and high-dimensional data effectively via the 
stochastic gradient descent (SGD) algorithm. In each step, 
the back propagation procedure requires only operations of 
order 0{p) The training algorithm takes 0{Cnp) time 
(C is a constant in terms of the total numbers of epochs, stage 
numbers, etc.). In addition, SGD is easy to be parallelized 
and thus could be efficiently trained using GPUs. 

3.3 Connections to Existing Models There is a close con¬ 
nection between sparse coding and neural network. In llT2 . 
a feed-forward neural network, named LISTA, is proposed 
to efficiently approximate the sparse code a of input signal 
X, which is obtained by solving ( |2.1| i in advance. The LISTA 
network learns the hyperparameters as a general regression 
model from training data to their pre-solved sparse codes us¬ 
ing back-propagation. 







LISTA overlooks the useful geometric information 
among data points ll34l . and therefore could be viewed as 
a special case of TAGnet in Fig. [^when a = 0 (i.e., remov¬ 
ing the xL branches). Moreover, LISTA aims to approx¬ 
imate the “optimal” sparse codes pre-obtained from (2.1 1 , 


w 


and therefore requires the estimation of D and the tedious 
pre-computation of A. The authors did not exploit its poten¬ 
tial in supervised and task-specific feature learning. 
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4 A Deeper Look: Hierarchical Clustering by DTAGnet 

Deep networks are well known for their capabilities to learn 
semantically rich representations by hidden layers llOl . In 
this section, we investigate how the intermediate features 
(k = 1,2) in TAGnet (Fig. [T](a)) can be interpreted, and 
further utilized to improve the model, for specific clustering 
tasks. Compared to related non-deep models such a 
hierarchical clustering property is another unique advantage 
of being deep. 

Our strategy is mainly inspired by the algorithmic 
framework of deeply supervised nets HD. As in Fig. 
our proposed Deeply-Task-specific And Graph-regularized 
Network (DTAGnet) brings in additional deep feedbacks, 
by associating a clustering-oriented local auxiliary loss 
Ck{7‘k,<^k) (k = 1, 2) with each stage. Such an auxiliary 
loss takes the same form as the overall C{A, uj), except that 
the expected cluster number may be different, depending on 
the auxiliary clustering task to be performed. The DTAGnet 
backpropagates errors not only from the overall loss layer, 
but also simultaneously from the auxiliary losses. 

While seeking the optimal performance of the target 
clustering, DTAGnet is also driven by two auxiliary tasks 
that are explicitly targeted at clustering specific attributes. It 
enforcrs constraint at each hidden representation for directly 
making a good cluster prediction. In addition to the over¬ 
all loss, the introduction of auxiliary losses gives another 
strong push to obtain discriminative and sensible features 
at each individual stage. As discovered in the classification 
experiments in ca, the auxiliary loss both acts as feature 
regularization to reduce generalization errors and results in 
faster convergence. We also find in Section V that every 
(A: = 1,2) is indeed most suited for its targeted task. 

In lIZTll . a Deep Semi-NMF model was proposed to learn 
hidden representations, that grant themselves an interpreta¬ 
tion of clustering according to different attributes. The au¬ 
thors considered the problem of mapping facial images to 
their identities. A face image also contains attributes like 
pose and expression that help identify the person depicted. In 
their experiments, the authors found that by further factoriz¬ 
ing this mapping in a way that each factor adds an extra layer 
of abstraction, the deep model could automatically learn la¬ 
tent intermediate representations that are implied for clus¬ 
tering identity-related attributes. Although there is a clus¬ 
tering interpretation, those hidden representations are not 
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Figure 2: The DTAGnet architecture, taking the CMU Mul- 
tiPIE dataset as an example. The model is able to simultane¬ 
ously learn features for pose clustering (Zi), for expression 
clustering (Z 2 ), and for identity clustering (A). The first two 
attributes are related to and helpful for the last (overall) task. 
Part of image sources are referred from ini and Ell. 


specifically optimized in clustering sense. Instead, the en¬ 
tire model is trained with only the overall reconstruction 
loss, after which clustering is performed using K-means on 
learnt features. Consequently, their clustering performance 
is not satisfactory. Our study shares the similar observation 
and motivation with lIZTl . but in a more task-specific manner 
by performing the optimizations of auxiliary clustering tasks 
jointly with the overall task. 

5 Experiment Results 

5.1 Datasets and measurements We evaluate the pro¬ 
posed model on three publicly available datasets: 

• MNIST iH consists of a total number of 70, 000 
quasi-binary, handwritten digit images, with digits 0 to 
9. The digits are normalized and centered in fixed-size 
images of 28 x 28. 

• CMU MultiPIE lfT3l contains around 750, 000 images 
of 337 subjects, that are captured under varied labora¬ 
tory conditions. A unique property of CMU MultiPIE 
lies in that each image comes with labels for the iden¬ 
tity, illumination, pose and expression attributes. That 
is why CMU MultiPIE is chosen in ll27l to learn multi¬ 
attribute features (Eig. for hierarchical clustering. In 
our experiments, we follow and adopt a subset of 
13, 230 images of 147 subjects in 5 different poses and 
6 different emotions. Notably, we do not pre-process 
the images by using piece-wise affine warping as uti¬ 
lized by EZl to align these images. 

• COIL20 IItTI contains 1, 440 32 x 32 gray scale 
images of 20 objects (72 images per object). The 
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ing by setting its learning rate to be 0. Experiments run 
on a workstation with 12 Intel Xeon 2.67GHz CPUs and 1 
GTX680 GPU. The training takes approximately 1 hour on 
the MNIST dataset. It is also observed that the training effi¬ 
ciency of our model scales approximately linearly with data. 

In our experiments, we set the default value of a to be 
5, p to be 128, and A to be chosen from [0.1, 1] by cross- 
validatior0 A dictionary D is first learned from X by K- 
SVD Q. W, S and B are then initialized based on ( |3.5| t. 
L is also pre-calculated from P, which is formulated by the 
Gaussian Kernel; = exp(— ^ selected 

by cross-validation). After obtaining the output A from the 
initial (D)TAGnet models, a; (or Uk) could be initialized 
based on minimizing p.7[ t or ( |3.9| t over A (or Z^). 
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5.3 Comparison experiments and analysis 


5.3.1 Benefits of the task-specific deep architecture 

We denote the proposed model of TAGnet plus entropy- 
minimization loss (EML) ( |3.7| l as TAGnet-EML, and the one 
plus maximum-margin loss (MML) (3.91 as TAGnet-MML, 
respectively. We include the following comparison methods: 


• We refer to the initializations of the proposed joint 
models as their “Non-Joint” counterparts, denoted as 
NJ-TAGnet-EML and NJ-TAGnet-MML (NJ short for 
non-joint), respectively. 


(b) 

Eigure 3: The accuracy and NMI plots of TAGnet- 
EML/TAGnet-MML on MNIST, starting from the initializa¬ 
tion, and tested every 100 iterations. The accuracy and NMI 
of SC-EML/SC-MML are also plotted as baselines. 


images of each object were taken 5 degree apart. 

Although the paper only evaluates the proposed method 
using image datasets, the methodology itself is not limited 
to only image subjects. We apply two widely-used measures 
to evaluate the clustering performances: the accuracy and the 
Normalized Mutual Information(NMI) El. il- We follow 
the convention of many clustering work Il34l[32ll29l . and do 
not distinguish training from testing. We train our models on 
all available samples of each dataset, reporting the clustering 
performances as our testing results. Results are averaged 
from 5 independent runs. 


• We design a Baseline Encoder (BE), which is a fully- 
connected feedforward network, consisting of three 
hidden layers of dimension p with ReLU neuron. It is 
obvious that the BE has the same parameter complexity 
as TAGnej^ The BEs are also tuned by EML or 
MML in the same way, denoted as BE-EML or BE- 
MML, respectively. We intend to verify our important 
argument, that the proposed model benefits from the 
task-specific TAGnet architecture, rather than just the 
large learning capacity of generic deep models. 


• We compare the proposed models with their clos¬ 
est “shallow” competitors, i.e., the joint optimization 
methods of graph-regularized sparse coding and dis¬ 
criminative clustering in 1291 . We re-implement their 
work using both 13.7|l or (3.91 losses, denoted as SC- 


EML and SC-MML (SC short for sparse coding). Since 
in l29l the authors already revealed SC-MML outper¬ 
forms the classical methods such as MMC and £i graph 
methods, we do not compare with them again. 


5.2 Experiment settings The proposed networks are im¬ 
plemented using the cuda-convnet package lITSl . The net¬ 
work takes K = 2 stages by default. We apply a constant 
learning rate of 0.01 with no momentum to all trainable 
layers. The batch size of 128. In particular, to encode 
graph regularization as a prior, we fix L during model train- 


• We also include Deep Semi-NMF IZTl as a state-of- 
the-art deep learning-based clustering work. We mainly 

'The default values of a and p are inferred from the related sparse coding 
literature 1341 . and validated in experiments. 

^except for the “9” layers, each of which contains only p free parameters 
and thus ignored 
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Figure 4; The clustering accuracy and NMI plots (x-axis logarithm scale) of TAGnet-EML/TAGnet-MML versus the 
parameter choices of a, on: (a) (b) MNIST; (c) (d) CMU MultiPIE; (e) (f) COIL20. 


compare our results with their reported performances 
on CMU MultiPIE. 0 

As revealed by the full comparison results in Table[T] the 
proposed task-specific deep architectures outperform other 
with a noticeable margin. The underlying domain expertise 
guides the data-driven training in a more principled way. In 
contrast, the “general-architecture” baseline encoders (BE- 
EML and BE-MML) appear to produce much worse (even 
worst) results. Eurthermore, it is evident that the proposed 
end-to-end optimized models outperform their “non-joint” 
counterparts. Eor example, on the MNIST dataset,TAGnet- 
MML surpasses NJ-TAGnet-MML by around 4% in accu¬ 
racy and 5% in NMI. 

By comparing the TAGnet-EML/TAGnet-MML with 
SC-EML/SC-MML, we draw a promising conclusion; 
adopting a more parameterized deep architecture allows a 
larger feature learning capacity compared to conventional 
sparse coding. Although similar points are well made in 
many other fields ifTSll . we are interested in a closer look 
between the two. Eig. plots the clustering accuracy and 
NMI curves of TAGnet-EML/TAGnet-MML on the MNIST 
dataset, along with iteration numbers. Each model is well 

^With various component numbers tested in ED, we choose their best 
cases (60 components). 


initialized at the very beginning, and the clustering accu¬ 
racy and NMI are computed every 100 iterations. At hrst, 
the clustering performances of deep models are even slightly 
worse than sparse-coding methods, mainly since the initial¬ 
ization of TAGnet hinges on a truncated approximated of 
graph-regularized sparse coding. After a small number of it¬ 
erations, the performance of the deep models surpass sparse 
coding ones, and continue rising monotonically until reach¬ 
ing a higher plateau. 


5.3.2 Effects of graph regularization In (2.2 1 , the graph 
regularization term imposes stronger smoothness constraints 
on the sparse codes with a larger a. It also happens to the 
TAGnet. We investigate how the clustering performances 
of TAGnet-EML/TAGnet-MML are influenced by various 
a values. Erom Eig. we observe the identical general 
tendency on all three datasets. While a increases, the 
accuracy/NMI result will first rise then decrease, with the 
peak appearing between a € [5, 10]. As an interpretation, 
the local manifold information is not sufficiently encoded 
when a is too small (a = 0 will completely disable the x L 
branch of TAGnet, and reduces its to the LISTA network ifT^ 
fine-tuned by the losses). On the other hand, when a is large, 
the sparse codes are “over-smoothened” with a reduced 
discriminative ability. Note that similar phenomenons are 




















































Table 1: Accuracy and NMI performance comparisons on all three datasets 




TAGnet 

-EML 

TAGnet 

-MML 

NJ-TAGnet 

-EML 

NJ-TAGnet 

-MML 

BE 

-EML 

BE 

-MML 

SC 

-EML 

SC 

-MML 

Deep 

Semi-NME 

MNIST 

Acc 

0.6704 

0.6922 

0.6472 

0.5052 

0.5401 

0.6521 

0.6550 

0.6784 

/ 

NMI 

0.6261 

0.6511 

0.5624 

0.6067 

0.5002 

0.5011 

0.6150 

0.6451 

1 

CMU 

MultiPIE 

Acc 

0.2176 

0.2347 

0.1727 

0.1861 

0.1204 

0.1451 

0.2002 

0.2090 

0.17 

NMI 

0.4338 

0.4555 

0.3167 

0.3284 

0.2672 

0.2821 

0.3337 

0.3521 

0.36 

COIL20 

Acc 

0.8553 

0.8991 

0.7432 

0.7882 

0.7441 

0.7645 

0.8225 

0.8658 

/ 

NMI 

0.9090 

0.9277 

0.8707 

0.8814 

0.8028 

0.8321 

0.8850 

0.9127 

1 


also reported in other relevant literature, e. g. , 1341 l29l . 
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Nc 


(b) 

Figure 5: The clustering accuracy and NMI plots of TAGnet- 
EML/TAGnet-EML versus the cluster number Nc ranging 
from 2 to 10, on MNIST. 

Eurthermore, comparing among Eig. El (a) - if), it 
is noteworthy to observe how graph regularization behaves 
differently on three of them. We notice that the COIL20 
dataset is the one that is the most sensitive to the choice of 
a. Increasing a from 0.01 to 50 leads to a improvement 
of more than 10%, in terms of both accuracy and NMI. It 
verifies the significance of graph regularization when trying 
samples are limited l32]| . On the MNIST dataset, both 
models obtain a gain of up to 6% in accuracy and 5% in 


NMI, by tuning a from 0.01 to 10. However, unlike COIL20 
that almost always favors larger a, the model performance on 
the MNIST dataset tends to be not only saturated, but even 
significantly hampered when a continues rising to 50. The 
CMU MultiPIE dataset witnesses moderate improvements 
of around 2% in both measurements. It is not as sensitive 
to a as the other two. Potentially, it might be due to 
the complex variability in original images that makes the 
graph W unreliable for estimating the underlying manifold 
geometry. We suspect that more sophisticated graphs may 
help alleviate the problem, and will explore it in future. 


5.3.3 Scalability and robustness On the MNIST dataset. 
We re-conduct the clustering experiments with the clus¬ 
ter number Nc ranging from 2 to 10, using TAGnet- 
EML/TAGnet-MML. Pig. shows that the clustering accu¬ 
racy and NMI change by varying the number of clusters. The 
clustering performance transits smoothly and robustly when 
the task scale changes. 

To examine the proposed models’ robustness to noise, 
we add various Gaussian noise, whose standard deviation 
s ranges from 0 (noiseless) to 0.3, to re-train our MNIST 
model. Pig. [^indicates that both TAGnet-EML and TAGnet- 
MML own certain robustness to noise. When s is less 
than 0.1, there is even little visible performance degradation. 
While TAGnet-MML constantly outperforms TAGnet-EML 
in all experiments (as MMC is well-known to be highly 
discriminative ED), it is interesting to observe in Pig. |^that 
the latter one is slightly more robust to noise than the former. 
It is perhaps owing to the probability-driven loss form (3.7 1 
of EML that allows for more flexibility. 


5.4 Hierarchical clustering on CMU MultiPIE As ob¬ 
served, CMU MultiPIE is very challenging for the basic 
identity clustering task. However, it comes with several 
other attributes: pose, expression, and illumination, which 
could be of assistance in our proposed DTAGnet framework. 
In this section, we apply the similar setting of ED on the 
same CMU MultiPIE subset, by setting pose clustering as 
the Stage I auxiliary task, and expression clustering as the 
Stage n auxiliary taslj^ In that way, we target Ci (Zi, oti) at 

^In fact, although claimed to be applicable to multiple attributes, (27) 
only examined the first level features for pose clustering without considering 


































5 clusters, C' 2 (Z 2 , UJ 2 ) at 6 clusters, and finally C{A, uj) as 
147 clusters. 



(a) 
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(b) 

Figure 6: The clustering accuracy and NMl plots of TAGnet- 
EML/TAGnet-MML versus the noise level s, on MNIST. 

The training of DTAGnet-EML/DTAGnet-MML fol¬ 
lows the same aforementioned process except for consider¬ 
ing extra back-propagated gradients from task Cfc(Zfc,a;/j) 
in Stage k {k = 1, 2). After then, we test each Cfc(Zfc,a;/j) 
separately on their targeted task. In DTAGnet, each auxiliary 
task is also jointly optimized with its intermediate feature 
Zfc, which differentiate our methodology substantially from 
m- It is thus no surprise to see in Table that each aux¬ 
iliary task obtains much improved performances than 1220 
Most notably, the performances of the overall identity clus¬ 
tering task witness a very impressive boost of around 7% 
in accuracy. We also test DTAGnet-EML/DTAGnet-MML 
with only C'i(Zi, uti) or C' 2 (Z 2 , ^ 2 ) kept. Experiments ver¬ 
ify that by adding auxiliary tasks gradually, the overall task 

expressions, since it relied on a warping technique to pre-process images, 
that gets rid of most expression variability. 

^Ini27l Table. 2, it reports that the best accuracy of pose clustering task 
falls around 28%, using the most suited layer features. 


keeps being benehted. Those auxiliary tasks, when enforced 
together, can also reinforce each other mutually. 


Table 2: Effects of incorporating auxiliary clustering tasks in 
DTAGnet-EML/DTAGnet-MML (P; Pose; E; Expression; I: 
Identity) 


Method 

Stage I 

Stage II 

Overall 

Task 

Acc 

Task 

Acc 

Task 

Acc 

DTAGnet 

-EML 

/ 

/ 

/ 

/ 

I 

0.2176 

P 

0.5067 

/ 

/ 

I 

0.2303 

/ 

/ 

E 

0.3676 

I 

0.2507 

P 

0.5407 

E 

0.4027 

I 

0.2833 

DTAGnet 

-MML 

/ 

/ 

/ 

/ 

I 

0.2347 

P 

0.5251 

/ 

/ 

I 

0.2635 

/ 

/ 

E 

0.3988 

I 

0.2858 

P 

0.5538 

E 

0.4231 

I 

0.3021 


One might be curious that, which one matters more in 
the performance boost; the deeply task-specific architec¬ 
ture that brings extra discriminative feature learning, or the 
proper design of auxiliary tasks that capture the intrinsic data 
structure characterized by attributes? 

Table 3: Effects of varying target cluster numbers of auxil¬ 
iary tasks in DTAGnet-EML/DTAGnet-MML 


Method 

#clusters 
in Stage I 

#clusters 
in Stage II 

Overall 

Accuracy 

DTAGnet 

-EML 

4 

4 

0.2827 

8 

8 

0.2813 

12 

12 

0.2802 

20 

20 

0.2757 

DTAGnet 

-MML 

4 

4 

0.3030 

8 

8 

0.3006 

12 

12 

0.2927 

20 

20 

0.2805 


To answer this important question, we vary the target 
cluster number in either CijZi, Wi) or C 2 (Z 2 , ^ 2 ), and re¬ 
conduct the experiments. Table [^reveals that more auxiliary 
tasks, even those without any striaghtforward task-specihc 
interpretation (e.g., partitioning the Multi-PIE subset into 
4, 8, 12 or 20 clusters hardly makes semantic sense), may 
still help gain better performances. It is comprehensible that 
they simply promote more discriminative feature learning in 
a low-to-high, coarse-to-hne scheme. In fact, it is a comple¬ 
mentary observation to the conclusion found in classihcation 
M- On the other hand, at least in this specihc case, while 
the target cluster numbers of auxiliary tasks get closer to 
the ground-truth (5 and 6 here), the models seem to achieve 














































the best performances. We conjecture that when properly 
“matched” , every hidden representation in each layer is in 
fact most suited for clustering the attributes corresponding 
to the layer of interest. The whole model can be resembled 
to the problem of sharing low-level feature filters among sev¬ 
eral relevant high-level tasks in convolutional networks im, 
but in a distinct context. 

We hence conclude that, the deeply-supervised fashion 
shows to be helpful for the deep clustering models, even 
when there are no explicit attributes for constructing a practi¬ 
cally meaningful hierarchical clustering problem. However, 
it is preferable to exploit those attributes when available, as 
they lead to not only superior performances but more clearly 
interpretable models. The learned intermediate features can 
be potentially utilized for multi-task learning ESll . 

6 Conclusion 

In this paper, we present a deep learning-based clustering 
framework. Trained from end to end, it features a task- 
specific deep architecture inspired by the sparse coding 
domain expertise, which is then optimized under clustering- 
oriented losses. Such a well-designed architecture leads to 
more effective initialization and training, and significantly 
outperforms generic architectures of the same parameter 
complexity. The model could be further interpreted and 
enhanced, by introducing auxiliary clustering losses to the 
intermediate features. Extensive experiments verify the 
effectiveness and robustness of the proposed models. 
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